Data Wrangling
10 February, 2024
Please install and load the following packages
Access lecture slide from the course landing page
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
For this lecture, we will be using the census dataset from openintro package
Taking a peek at the data
# A tibble: 6 × 8
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 22800 20 Female White
3 2000 Florida 0 20 Male Black
4 2000 Florida 23000 6 Female White
5 2000 Florida 48000 55 Male White
6 2000 Florida 74000 43 Female White
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>
Alternatively, you can look at the bottom 6 rows
# A tibble: 6 × 8
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Virginia 45100 5 Male White
2 2000 Virginia 8500 68 Female White
3 2000 Virginia 55400 12 Male White
4 2000 Virginia 15000 60 Male Black
5 2000 Virginia NA 60 Male Black
6 2000 Virginia 39370 47 Male White
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>
Select only year, state, family income, age, sex and personal income
select(census_data,
census_year, state_fips_code,
total_family_income, age, sex,
total_personal_income)# A tibble: 500 × 6
census_year state_fips_code total_family_income age sex
<int> <fct> <int> <int> <fct>
1 2000 Florida 14550 44 Male
2 2000 Florida 22800 20 Female
3 2000 Florida 0 20 Male
4 2000 Florida 23000 6 Female
5 2000 Florida 48000 55 Male
6 2000 Florida 74000 43 Female
7 2000 Florida 23000 60 Female
8 2000 Florida 74000 47 Female
9 2000 Florida 60000 54 Female
10 2000 Florida 14600 58 Female
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>
Alternatively, we could choose what we do not want
# A tibble: 500 × 6
census_year state_fips_code total_family_income age sex
<int> <fct> <int> <int> <fct>
1 2000 Florida 14550 44 Male
2 2000 Florida 22800 20 Female
3 2000 Florida 0 20 Male
4 2000 Florida 23000 6 Female
5 2000 Florida 48000 55 Male
6 2000 Florida 74000 43 Female
7 2000 Florida 23000 60 Female
8 2000 Florida 74000 47 Female
9 2000 Florida 60000 54 Female
10 2000 Florida 14600 58 Female
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>
Also, data types can be selected
# A tibble: 500 × 4
state_fips_code sex race_general marital_status
<fct> <fct> <fct> <fct>
1 Florida Male Two major races Married/spouse present
2 Florida Female White Never married/single
3 Florida Male Black Never married/single
4 Florida Female White Never married/single
5 Florida Male White Married/spouse present
6 Florida Female White Married/spouse present
7 Florida Female White Married/spouse present
8 Florida Female White Married/spouse present
9 Florida Female Black Married/spouse present
10 Florida Female White Widowed
# ℹ 490 more rows
Arranging the data with states in the alphabetical order
# A tibble: 500 × 8
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Alabama 5500 73 Female Black
2 2000 Alabama 63820 40 Male White
3 2000 Alabama 11200 60 Female Black
4 2000 Alabama 34500 43 Female Other
5 2000 Alabama 33600 7 Male White
6 2000 Arizona 32500 9 Female White
7 2000 Arizona 46800 53 Female White
8 2000 Arizona 30000 60 Female White
9 2000 Arizona 0 67 Female White
10 2000 Arizona 51000 27 Male White
# ℹ 490 more rows
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>
Can you guess how would you arrange in the descending order?
# A tibble: 500 × 8
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 22800 20 Female White
3 2000 Florida 0 20 Male Black
4 2000 Florida 23000 6 Female White
5 2000 Florida 48000 55 Male White
6 2000 Florida 74000 43 Female White
7 2000 Florida 23000 60 Female White
8 2000 Florida 74000 47 Female White
9 2000 Florida 60000 54 Female Black
10 2000 Florida 14600 58 Female White
# ℹ 490 more rows
# ℹ 2 more variables: marital_status <fct>, total_personal_income <int>
smoking from the openintro package.Making the variables names more cleaner
# A tibble: 500 × 8
census_year state total_family_income age sex race_general marital_status
<int> <fct> <int> <int> <fct> <fct> <fct>
1 2000 Flor… 14550 44 Male Two major r… Married/spous…
2 2000 Flor… 22800 20 Fema… White Never married…
3 2000 Flor… 0 20 Male Black Never married…
4 2000 Flor… 23000 6 Fema… White Never married…
5 2000 Flor… 48000 55 Male White Married/spous…
6 2000 Flor… 74000 43 Fema… White Married/spous…
7 2000 Flor… 23000 60 Fema… White Married/spous…
8 2000 Flor… 74000 47 Fema… White Married/spous…
9 2000 Flor… 60000 54 Fema… Black Married/spous…
10 2000 Flor… 14600 58 Fema… White Widowed
# ℹ 490 more rows
# ℹ 1 more variable: total_personal_income <int>
smoking dataset, rename the highest_qualification variable as educationFiltering ages more than 40
# A tibble: 212 × 9
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 48000 55 Male White
3 2000 Florida 74000 43 Female White
4 2000 Florida 23000 60 Female White
5 2000 Florida 74000 47 Female White
6 2000 Florida 60000 54 Female Black
7 2000 Florida 14600 58 Female White
8 2000 Florida 37000 51 Female White
9 2000 Florida 32000 62 Female White
10 2000 Florida 100100 44 Female White
# ℹ 202 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>
Filtering only females
# A tibble: 232 × 9
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 22800 20 Female White
2 2000 Florida 23000 6 Female White
3 2000 Florida 74000 43 Female White
4 2000 Florida 23000 60 Female White
5 2000 Florida 74000 47 Female White
6 2000 Florida 60000 54 Female Black
7 2000 Florida 14600 58 Female White
8 2000 Florida 0 33 Female White
9 2000 Florida 37000 51 Female White
10 2000 Florida 32000 62 Female White
# ℹ 222 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>
Filtering cases where age is above 40 AND sex is female
# A tibble: 100 × 9
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 74000 43 Female White
2 2000 Florida 23000 60 Female White
3 2000 Florida 74000 47 Female White
4 2000 Florida 60000 54 Female Black
5 2000 Florida 14600 58 Female White
6 2000 Florida 37000 51 Female White
7 2000 Florida 32000 62 Female White
8 2000 Florida 100100 44 Female White
9 2000 Florida 43950 69 Female White
10 2000 Florida 31600 80 Female White
# ℹ 90 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>
Filtering cases where age is above 40 OR sex is female
# A tibble: 344 × 9
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 22800 20 Female White
3 2000 Florida 23000 6 Female White
4 2000 Florida 48000 55 Male White
5 2000 Florida 74000 43 Female White
6 2000 Florida 23000 60 Female White
7 2000 Florida 74000 47 Female White
8 2000 Florida 60000 54 Female Black
9 2000 Florida 14600 58 Female White
10 2000 Florida 0 33 Female White
# ℹ 334 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>
Filtering cases where age is NOT a female
# A tibble: 268 × 9
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 0 20 Male Black
3 2000 Florida 48000 55 Male White
4 2000 Florida 113000 8 Male White
5 2000 Florida 76900 25 Male White
6 2000 Florida 48000 1 Male White
7 2000 Florida 57200 31 Male White
8 2000 Florida 49000 31 Male White
9 2000 Florida 50090 2 Male White
10 2000 Florida 64800 47 Male White
# ℹ 258 more rows
# ℹ 3 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>
smoking dataset, filter males- Fortunately, there is good news
- Introducing (drumrolls) pipe
- The pipe operator is written as %>% or |>
- Relies on a step-wise logic
- Useful in making the code more readable
- Eliminates the possibilities of making mistakes in code writing
Original Code
Finding the total personal income by sex
Can you similarly calculate the total personal income by state?
smoking dataset, select only the highest_qualification and smoke_weekdays variables, rename the variables appropriately and calculate the total number of cigarettes smoke by people in each qualificationcase_whencase_when() is used to create a new variable using conditional logic
We want to convert income into income categories
census_data |>
mutate(
income_category = case_when(
total_family_income >= 0 &
total_family_income < 40000 ~ "Low Income",
total_family_income >= 40000 &
total_family_income < 100000 ~ "Middle Income",
total_family_income >= 100000 ~ "High Income")
)# A tibble: 500 × 10
census_year state_fips_code total_family_income age sex race_general
<int> <fct> <int> <int> <fct> <fct>
1 2000 Florida 14550 44 Male Two major races
2 2000 Florida 22800 20 Female White
3 2000 Florida 0 20 Male Black
4 2000 Florida 23000 6 Female White
5 2000 Florida 48000 55 Male White
6 2000 Florida 74000 43 Female White
7 2000 Florida 23000 60 Female White
8 2000 Florida 74000 47 Female White
9 2000 Florida 60000 54 Female Black
10 2000 Florida 14600 58 Female White
# ℹ 490 more rows
# ℹ 4 more variables: marital_status <fct>, total_personal_income <int>,
# total_income <int>, income_category <chr>
smoking dataset like 15-25, 26-40, 40-59 and 59+